Overview

The model is grading ball contact quality. The quality metric for the model will be play_result since it gives us an outcome for which we can grade the success of contact. A single is greater quality than an out which is lower quality. Ultimately, play_result reveals if a hit was able to create an opportunity for scoring which is a good indicator of quality. This model will predict play_result so that outcomes for future hits can be predicted based on a variety of features. The model also reveals which variables are most important to the result of a hit.

See also: Instead of using one of the data columns as a quality metric, I had an additional idea for a weighted metric. We know that a hits quality is not solely defined by the outcome of the hit and that the different variables do not have equal effect on quality. This notebook is also in the repo.

Loading Libraries & Data

# Load necessary libraries
library(here)
## here() starts at C:/Users/grack/OneDrive/Documents
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(nnet)

# Read in csv and view df
df <- read.csv(here("baseball_data.csv"))

Data Cleaning & Manipulation

# Data Cleaning
# When play_result = "Foul Ball", replace NA in hit_type with "foul_ball"
df$hit_type <- replace(df$hit_type, df$hit_type == "", "foul_ball")
df <- df %>% filter(play_result != "")

# Turn categorical variables into factors so they can be used within the model

df$batter_side <- as.factor(df$batter_side)
df$pitcher_throws <- as.factor(df$pitcher_throws)
df$hit_type <- as.factor(df$hit_type)
df$play_result <- as.factor(df$play_result)
df$venue <- as.factor(df$venue)

# Turn bearing and angle into a flag rather than a number to eliminate having to work with negative values and simplify the model.

# Bearing will now read as left_field/right_field if ball if in right field or not
df =  mutate(df, bearing = ifelse(bearing > 0, "right_field", "left_field"))
# Angle will now read as upward/downward if ball is traveling upward or not
df =  mutate(df, angle = ifelse(angle > 0, "upward", "downward"))

Picking Features and EDA

Play_result is a measure of success so it is going to be the quality metric for this model. Quality hits will be those that result in a positive play result value such as a single. Features will be chosen based on their relationship to the play_result variable. I chose the following features to investigate: Exit Speed, Pitch Type, Hit Type, Bearing and Angle.

Exit Speed

High exit speed means that the hit is more likely to result in an effective play such as an HR, single, double or triple. The mean exit speed for a HR is about 104 mph vs the mean exit speed for a foul ball is 75 mph. High exit speed has a relationship to positive play results.

## [1] "Foul Ball Mean Exit Speed: 75.446477704323"
## [1] "Home Run Mean Exit Speed: 104.384251524052"

Pitch Type

Four seams, or fastballs and sliders are the most frequent type of pitch thrown and are most likely to result in a foul ball or an out. Pitches are more difficult are likely to be thrown most often to maximize effectiveness. The pitch type has a relationship on play result.

  • Highlight a pitch type to view the data in more detail

Hit Type

Bearing

## 
##  Pearson's Chi-squared test
## 
## data:  df$play_result and df$bearing
## X-squared = 752.33, df = 8, p-value < 2.2e-16

Field position plays a role in the likelihood of a play result. A play hit to left field is more likely to result in a single, double and a home run than in right field. Hits to left field are also less likely to receive an out and foul ball than one hit to right field. Also, when a chi-square test is applied, we get a p-value less than 0.05 so these variables are related.

Angle

Upward launch angles occur more frequently with ball contact, so we need to confirm that there is a correlation using a chi-square test.

## 
##  Pearson's Chi-squared test
## 
## data:  df$play_result and df$angle
## X-squared = 7732.2, df = 8, p-value < 2.2e-16

Since we get a p-value of less than 0.05 then we can conclude that the variables angle and play result are dependent.

Splitting Training & Testing Data

# Filter df down to relevant columns
df <- select(df, c(play_result, exit_speed, pitch_type, hit_type, bearing, angle))

# Split into training and testing data, 70% training, 30% testing
set.seed(1)

sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))

# Apply sample to df
train  <- df[sample, ]
test   <- df[!sample, ]

Model- Multinomial Logistic Regression for Classification

# Build the multinomial model using the selected features to predict play result
m1 <- multinom(play_result ~ log(exit_speed) + pitch_type + hit_type + bearing + angle, data=train)

summary(m1)

Predictions

# Make predictions on our testing data
test$predicted_outcomes <- predict(m1, test)

Accuracy

## [1] "Percent Outcomes Correct: 0.834476910030067 %"